当前位置: 开发笔记 > 编程语言 > 正文

电子书|Spark流处理|MasteringStructuredStreamingandSparkStreaming

作者：云聪京初瑞子_617 | 来源：互联网 | 2023-08-28 21:05

MasteringStructuredStreamingandSpar

点击关注上方“知了小巷”，

设为“置顶或星标”，第一时间送达干货。

虽是英文版，但内容丰富细致，值得一看，先看完整目录吧，文末有获取提示。

Part I. Fundamentals of Stream Processing with Apache Spark

1. Introducing Stream Processing

What Is Stream Processing?

Batch Versus Stream Processing

The Notion of Time in Stream Processing

The Factor of Uncertainty

Some Examples of Stream Processing

Scaling Up Data Processing

MapReduce

The Lesson Learned: Scalability and Fault Tolerance

Distributed Stream Processing

Stateful Stream Processing in a Distributed System

Introducing Apache Spark

The First Wave: Functional APIs

The Second Wave: SQL

A Unified Engine

Spark Components

Spark Streaming

Structured Streaming

Where Next?

2. Stream-Processing Model

Sources and Sinks

Immutable Streams Defined from One Another

Transformations and Aggregations

Window Aggregations

Tumbling Windows

Sliding Windows

Stateless and Stateful Processing

Stateful Streams

An Example: Local Stateful Computation in Scala

A Stateless Definition of the Fibonacci Sequence as a Stream

Transformation

Stateless or Stateful Streaming

The Effect of Time

Computing on Timestamped Events

Timestamps as the Provider of the Notion of Time

Event Time Versus Processing Time

Computing with a Watermark

Summary

3. Streaming Architectures

Components of a Data Platform

Architectural Models

The Use of a Batch-Processing Component in a Streaming Application

Referential Streaming Architectures

The Lambda Architecture

The Kappa Architecture

Streaming Versus Batch Algorithms

Streaming Algorithms Are Sometimes Completely Different in Nature

Streaming Algorithms Can’t Be Guaranteed to Measure Well Against

Batch Algorithms

Summary

4. Apache Spark as a Stream-Processing Engine

The Tale of Two APIs

Spark’s Memory Usage

Failure Recovery

Lazy Evaluation

Cache Hints

Understanding Latency

Throughput-Oriented Processing

Spark’s Polyglot API

Fast Implementation of Data Analysis 50

To Learn More About Spark 51

Summary 51

5. Spark’s Distributed Processing Model

Running Apache Spark with a Cluster Manager

Examples of Cluster Managers

Spark’s Own Cluster Manager

Understanding Resilience and Fault Tolerance in a Distributed System

Fault Recovery

Cluster Manager Support for Fault Tolerance

Data Delivery Semantics

Microbatching and One-Element-at-a-Time

Microbatching: An Application of Bulk-Synchronous Processing

One-Record-at-a-Time Processing

Microbatching Versus One-at-a-Time: The Trade-Offs

Bringing Microbatch and One-Record-at-a-Time Closer Together

Dynamic Batch Interval

Structured Streaming Processing Model

The Disappearance of the Batch Interval

6. Spark’s Resilience Model

Resilient Distributed Datasets in Spark

Spark Components

Spark’s Fault-Tolerance Guarantees

Task Failure Recovery

Stage Failure Recovery

Driver Failure Recovery

Summary

Part II. Structured Streaming

7. Introducing Structured Streaming

First Steps with Structured Streaming

Batch Analytics

Streaming Analytics

Connecting to a Stream

Preparing the Data in the Stream

Operations on Streaming Dataset

Creating a Query

Start the Stream Processing

Exploring the Data

Summary

8. The Structured Streaming Programming Model

Initializing Spark

Sources: Acquiring Streaming Data

Available Sources

Transforming Streaming Data

Streaming API Restrictions on the DataFrame API

Sinks: Output the Resulting Data

format

outputMode

queryName

option

options

trigger

start()

Summary

9. Structured Streaming in Action

Consuming a Streaming Source

Application Logic

Writing to a Streaming Sink

Summary

10. Structured Streaming Sources

Understanding Sources

Reliable Sources Must Be Replayable

Sources Must Provide a Schema

Available Sources

The File Source

Specifying a File Format

Common Options

Common Text Parsing Options (CSV, JSON)

JSON File Source Format

CSV File Source Format

Parquet File Source Format

Text File Source Format

The Kafka Source

Setting Up a Kafka Source

Selecting a Topic Subscription Method

Configuring Kafka Source Options

Kafka Consumer Options

The Socket Source

Configuration

Operations

The Rate Source

Options

11. Structured Streaming Sinks

Understanding Sinks

Available Sinks

Reliable Sinks

Sinks for Experimentation

The Sink API

Exploring Sinks in Detail

The File Sink

Using Triggers with the File Sink

Common Configuration Options Across All Supported File Formats

Common Time and Date Formatting (CSV, JSON)

The CSV Format of the File Sink

The JSON File Sink Format

The Parquet File Sink Format

The Text File Sink Format

The Kafka Sink

Understanding the Kafka Publish Model

Using the Kafka Sink

The Memory Sink

Output Modes

The Console Sink

Options

Output Modes

The Foreach Sink

The ForeachWriter Interface

TCP Writer Sink: A Practical ForeachWriter Example

The Moral of this Example

Troubleshooting ForeachWriter Serialization Issues

12. Event Time–Based Stream Processing

Understanding Event Time in Structured Streaming

Using Event Time

Processing Time

Watermarks

Time-Based Window Aggregations

Defining Time-Based Windows

Understanding How Intervals Are Computed

Using Composite Aggregation Keys

Tumbling and Sliding Windows

Record Deduplication

Summary

13. Advanced Stateful Operations

Example: Car Fleet Management

Understanding Group with State Operations

Internal State Flow

Using MapGroupsWithState

Using FlatMapGroupsWithState

Output Modes

Managing State Over Time

Summary

14. Monitoring Structured Streaming Applications

The Spark Metrics Subsystem

Structured Streaming Metrics

The StreamingQuery Instance

Getting Metrics with StreamingQueryProgress

The StreamingQueryListener Interface

Implementing a StreamingQueryListener

15. Experimental Areas: Continuous Processing and Machine Learning

Continuous Processing

Understanding Continuous Processing

Using Continuous Processing

Limitations

Machine Learning

Learning Versus Exploiting

Applying a Machine Learning Model to a Stream

Example: Estimating Room Occupancy by Using Ambient Sensors

Online Training

Part III. Spark Streaming

16. Introducing Spark Streaming

The DStream Abstraction

DStreams as a Programming Model

DStreams as an Execution Model

The Structure of a Spark Streaming Application

Creating the Spark Streaming Context

Defining a DStream

Defining Output Operations

Starting the Spark Streaming Context

Stopping the Streaming Process

Summary

17. The Spark Streaming Programming Model

RDDs as the Underlying Abstraction for DStreams

Understanding DStream Transformations

Element-Centric DStream Transformations

RDD-Centric DStream Transformations

Counting

Structure-Changing Transformations

Summary

18. The Spark Streaming Execution Model

The Bulk-Synchronous Architecture

The Receiver Model

The Receiver API

How Receivers Work

The Receiver’s Data Flow

The Internal Data Resilience

Receiver Parallelism

Balancing Resources: Receivers Versus Processing Cores

Achieving Zero Data Loss with the Write-Ahead Log

The Receiverless or Direct Model

Summary

19. Spark Streaming Sources

Types of Sources

Basic Sources

Receiver-Based Sources

Direct Sources

Commonly Used Sources

The File Source

How It Works

The Queue Source

How It Works

Using a Queue Source for Unit Testing

A Simpler Alternative to the Queue Source: The ConstantInputDStream

The Socket Source

How It Works

The Kafka Source

Using the Kafka Source

How It Works

Where to Find More Sources

20. Spark Streaming Sinks

Output Operations

Built-In Output Operations

saveAsxyz

foreachRDD

Using foreachRDD as a Programmable Sink

Third-Party Output Operations

21. Time-Based Stream Processing

Window Aggregations

Tumbling Windows

Window Length Versus Batch Interval

Sliding Windows

Sliding Windows Versus Batch Interval

Sliding Windows Versus Tumbling Windows

Using Windows Versus Longer Batch Intervals

Window Reductions

reduceByWindow

reduceByKeyAndWindow

countByWindow

countByValueAndWindow

Invertible Window Aggregations

Slicing Streams

Summary

22. Arbitrary Stateful Streaming Computation

Statefulness at the Scale of a Stream

updateStateByKey

Limitation of updateStateByKey

Performance

Memory Usage

Introducing Stateful Computation with mapwithState

Using mapWithState

Event-Time Stream Computation Using mapWithState

23. Working with Spark SQL

Spark SQL

Accessing Spark SQL Functions from Spark Streaming

Example: Writing Streaming Data to Parquet

Dealing with Data at Rest

Using Join to Enrich the Input Stream

Join Optimizations

Updating Reference Datasets in a Streaming Application

Enhancing Our Example with a Reference Dataset

Summary

24. Checkpointing

Understanding the Use of Checkpoints

Checkpointing DStreams

Recovery from a Checkpoint

Limitations

The Cost of Checkpointing

Checkpoint Tuning

25. Monitoring Spark Streaming

The Streaming UI

Understanding Job Performance Using the Streaming UI

Input Rate Chart

Scheduling Delay Chart

Processing Time Chart

Total Delay Chart

Batch Details

The Monitoring REST API

Using the Monitoring REST API

Information Exposed by the Monitoring REST API

The Metrics Subsystem

The Internal Event Bus

Interacting with the Event Bus

Summary

26. Performance Tuning

The Performance Balance of Spark Streaming

The Relationship Between Batch Interval and Processing Delay

The Last Moments of a Failing Job

Going Deeper: Scheduling Delay and Processing Delay

Checkpoint Influence in Processing Time

External Factors that Influence the Job’s Performance

How to Improve Performance?

Tweaking the Batch Interval

Limiting the Data Ingress with Fixed-Rate Throttling

Backpressure

Dynamic Throttling

Tuning the Backpressure PID

Custom Rate Estimator

A Note on Alternative Dynamic Handling Strategies

Caching

Speculative Execution

Part IV. Advanced Spark Streaming Techniques

27. Streaming Approximation and Sampling Algorithms

Exactness, Real Time, and Big Data

Exactness

Real-Time Processing

Big Data

The Exactness, Real-Time, and Big Data triangle

Big Data and Real Time

Approximation Algorithms

Hashing and Sketching: An Introduction

Counting Distinct Elements: HyperLogLog

Role-Playing Exercise: If We Were a System Administrator

Practical HyperLogLog in Spark

Counting Element Frequency: Count Min Sketches

Introducing Bloom Filters

Bloom Filters with Spark

Computing Frequencies with a Count-Min Sketch

Ranks and Quantiles: T-Digest

T-Digest in Spark

Reducing the Number of Elements: Sampling

Random Sampling

Stratified Sampling

28. Real-Time Machine Learning

Streaming Classification with Naive Bayes

streamDM Introduction

Naive Bayes in Practice

Training a Movie Review Classifier

Introducing Decision Trees

Hoeffding Trees

Hoeffding Trees in Spark, in Practice

Streaming Clustering with Online K-Means

K-Means Clustering

Online Data and K-Means

The Problem of Decaying Clusters

Streaming K-Means with Spark Streaming

Part V. Beyond Apache Spark

29. Other Distributed Real-Time Stream Processing Systems

Apache Storm

Processing Model

The Storm Topology

The Storm Cluster

Compared to Spark

Apache Flink

A Streaming-First Framework

Compared to Spark

Kafka Streams

Kafka Streams Programming Model

Compared to Spark

In the Cloud

Amazon Kinesis on AWS

Microsoft Azure Stream Analytics

Apache Beam/Google Cloud Dataflow

30. Looking Ahead

Stay Plugged In

Seek Help on Stack Overflow

Start Discussions on the Mailing Lists

Attend Conferences

Attend Meetups

Read Books

Contributing to the Apache Spark Project

有5本PDF，公众号后台回复 ss

往期推荐：

Spark Core之Shuffle解析

数据模型⽆法复⽤，归根结底还是设计问题

数据仓库、数据湖、流批一体，终于有大神讲清楚了！

如何统⼀管理纷繁杂乱的数据指标？

项目管理实战20讲笔记（网易-雷蓓蓓）

元数据中⼼的关键⽬标和技术实现⽅案

Hive程序相关规范-有助于调优

HBase内部探险-数据模型

HBase内部探险-HBase是怎么存储数据的

HBase内部探险-一个KeyValue的历险

数据中台到底怎么建设呢？

到底什么样的企业应该建设数据中台？

数据中台到底是不是大数据的下一站？

推荐阅读

mapreduce
如何精通编程语言：全面指南与实用技巧

如何精通编程语言：全面指南与实用技巧 ... [详细]

蜡笔小新 2024-11-07 11:56:01
数组
PHP 对象生命周期与内存管理

本文详细介绍了 PHP 中对象的生命周期、内存管理和魔术方法的使用，包括对象的自动销毁、析构函数的作用以及各种魔术方法的具体应用场景。 ... [详细]

蜡笔小新 2024-11-12 13:35:26
hash
在范围[0..n-1]中产生m个不同的随机数 - Generating m distinct random numbers in the range [0..n-1]

Ihavetwomethodsofgeneratingmdistinctrandomnumbersintherange[0..n-1]我有两种方法在范围[0.n-1]中生 ... [详细]

蜡笔小新 2024-11-13 09:49:14
process
开机自启动的几种方式

0x01快速自启动目录快速启动目录自启动方式源于Windows中的一个目录，这个目录一般叫启动或者Startup。位于该目录下的PE文件会在开机后进行自启动 ... [详细]

蜡笔小新 2024-11-12 11:16:30
fetch
掌握MySQL数据库的基础语法与核心操作

本文详细介绍了MySQL数据库的基础语法与核心操作，涵盖从基础概念到具体应用的多个方面。首先，文章从基础知识入手，逐步深入到创建和修改数据表的操作。接着，详细讲解了如何进行数据的插入、更新与删除。在查询部分，不仅介绍了DISTINCT和LIMIT的使用方法，还探讨了排序、过滤和通配符的应用。此外，文章还涵盖了计算字段以及多种函数的使用，包括文本处理、日期和时间处理及数值处理等。通过这些内容，读者可以全面掌握MySQL数据库的核心操作技巧。 ... [详细]

蜡笔小新 2024-11-11 23:39:51
char
MySQL Decimal 类型的最大值解析及其在数据处理中的应用艺术

在关系型数据库中，表的设计与SQL语句的编写对性能的影响至关重要，甚至可占到90%以上。本文将重点探讨MySQL中Decimal类型的最大值及其在数据处理中的应用技巧，通过实例分析和优化建议，帮助读者深入理解并掌握这一重要知识点。 ... [详细]

蜡笔小新 2024-11-11 19:36:19
char
深入解析 SQL 数据库查询技术

本文深入探讨了SQL数据库查询技术，重点讲解了单表查询的各种方法。首先，介绍了如何从表中选择特定的列，包括查询指定列、查询所有列以及计算值的查询。此外，还详细解释了如何使用列别名来修改查询结果的列标题，并介绍了更名运算的应用场景和实现方式。通过这些内容，读者可以更好地理解和掌握SQL查询的基本技巧和高级用法。 ... [详细]

蜡笔小新 2024-11-09 18:21:57
char
使用JavaScript生成Java兼容的UUID代码实现与优化技巧

本文介绍了UUID（通用唯一标识符）的概念及其在JavaScript中生成Java兼容UUID的代码实现与优化技巧。UUID是一个128位的唯一标识符，广泛应用于分布式系统中以确保唯一性。文章详细探讨了如何利用JavaScript生成符合Java标准的UUID，并提供了多种优化方法，以提高生成效率和兼容性。 ... [详细]

蜡笔小新 2024-11-05 18:19:54
char
LeetCode 215: Top K Largest Elements Efficiently Explained

本文详细解析了LeetCode第215题，即高效寻找数组中前K个最大元素的问题。通过使用快速选择算法（partition），可以在平均时间复杂度为O(N)的情况下完成任务。本文不仅提供了算法的具体实现步骤，还深入探讨了partition算法的工作原理及其在不同场景下的应用，帮助读者更好地理解和掌握这一高效算法。 ... [详细]

蜡笔小新 2024-11-04 12:14:37
usb
杜甫《喜晴》的两种英译比较

本文对比了杜甫《喜晴》的两种英文翻译版本：a. Pleased with Sunny Weather 和 b. Rejoicing in Clearing Weather。a 版由 alexcwlin 翻译并经 Adam Lam 编辑，b 版则由哈佛大学的宇文所安教授 (Prof. Stephen Owen) 翻译。 ... [详细]

蜡笔小新 2024-11-12 15:02:28
jsp
Spark中使用map或flatMap将DataSet[A]转换为DataSet[B]时Schema变为Binary的问题及解决方案

本文探讨了在使用Spark的map或flatMap算子将一个数据集转换为另一个数据集时，遇到的Schema变为Binary的问题，并提供了详细的解决方案。 ... [详细]

蜡笔小新 2024-11-12 08:06:20
char
在 QQmlPropertyMap 的派生类中无法调用槽函数或 Q_INVOKABLE 方法？

在尝试对 QQmlPropertyMap 类进行测试驱动开发时，发现其派生类中无法正常调用槽函数或 Q_INVOKABLE 方法。这可能是由于 QQmlPropertyMap 的内部实现机制导致的，需要进一步研究以找到解决方案。 ... [详细]

蜡笔小新 2024-11-11 15:34:22
future
POJ 2482 星空中的星星：利用线段树与扫描线算法解决

在《POJ 2482 星空中的星星》问题中，通过运用线段树和扫描线算法，可以高效地解决星星在窗口内的计数问题。该方法不仅能够快速处理大规模数据，还能确保时间复杂度的最优性，适用于各种复杂的星空模拟场景。 ... [详细]

蜡笔小新 2024-11-09 12:09:08
fetch
探索阿里云RDS中MySQL的高效压缩存储引擎TokuDB应用

在过去，我曾使用过自建MySQL服务器中的MyISAM和InnoDB存储引擎（也曾尝试过Memory引擎）。今年初，我开始转向阿里云的关系型数据库服务，并深入研究了其高效的压缩存储引擎TokuDB。TokuDB在数据压缩和处理大规模数据集方面表现出色，显著提升了存储效率和查询性能。通过实际应用，我发现TokuDB不仅能够有效减少存储成本，还能显著提高数据处理速度，特别适用于高并发和大数据量的场景。 ... [详细]

蜡笔小新 2024-11-04 11:36:52
数组
深入解析 OpenCV 2 中 Mat 对象的类型、深度与步长属性

在OpenCV 2中，`Mat`类作为核心组件，对于图像处理至关重要。本文将深入探讨`Mat`对象的类型、深度与步长属性，这些属性是理解和优化图像操作的基础。通过具体示例，我们将展示如何利用这些属性实现高效的图像缩小功能。此外，还将讨论这些属性在实际应用中的重要性和常见误区，帮助读者更好地掌握`Mat`类的使用方法。 ... [详细]

蜡笔小新 2024-11-01 15:39:04

云聪京初瑞子_617

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章